lncRNA–disease association prediction method based on the nearest neighbor matrix completion model

State-of-the-art medical studies proved that long noncoding ribonucleic acids (lncRNAs) are closely related to various diseases. However, their large-scale detection in biological experiments is problematic and expensive. To aid screening and improve the efficiency of biological experiments, this study introduced a prediction model based on the nearest neighbor concept for lncRNA–disease association prediction. We used a new similarity algorithm in the model that fused potential associations. The experimental validation of the proposed algorithm proved its superiority over the available Cosine, Pearson, and Jaccard similarity algorithms. Satisfactory results in the comparative leave-one-out cross-validation test (with AUC = 0.96) confirmed its excellent predictive performance. Finally, the proposed model’s reliability was confirmed by performing predictions using a new dataset, yielding AUC = 0.92.


Scientific Reports
| (2022) 12:21653 | https://doi.org/10.1038/s41598-022-25730-0 www.nature.com/scientificreports/ interactome network-based approach to explore and forecast the latent interaction between lncRNAs and miR-NAs. They constructed graphs based on the similarity of lncRNAs-miRNAs and combined known interactions to calculate scores as predicted outcomes. Chen et al. 32 elaborated a new computational model named "Neighborhood Constraint Matrix Completion for MiRNA-Disease Association Prediction" (NCMCMDA) to predict potential miRNA-disease associations. They innovatively integrated neighborhood constraint with matrix completion, providing a novel idea of utilizing similarity information to assist the prediction. Immediately afterward, Chen et al. 33 developed a deep-belief network model for miRNA-disease association prediction (DBNMDA). Compared with the previous supervised models, DBNMDA innovatively utilized the information of all miRNAdisease pairs during the pretraining process. This reduced the impact of too few known associations on prediction accuracy to some extent. Fan et al. 34 developed the IDHI-MIRW approach to predict potential lncRNA disease associations based on a large-scale lncRNA disease heterogeneity network. It involved three lncRNA-related data types (lncRNA expression profiles, lncRNA-miRNA interactions, and lncRNA protein interactions) in forming three lncRNA similarity networks and three disease-related information (disease semantic similarity, disease miRNA association, and disease gene association) to form three disease similarity networks. The lncRNA topological similarity networks, disease topological similarity networks, and known lncRNA-disease bipartite graphs were combined to construct large-scale lncRNA disease heterogeneity networks. Then, the candidate lncRNAs for each query disease were prioritized using the RWRH algorithm. Alternatively, Sudipto et al. 35 proposed ranking LncRNAs using network diffusion (LION). This network diffusion approach integrated lncRNA, protein-protein, and disease protein networks to prioritize important lncRNAs in diseases. First, they constructed a network of lncRNA proteins, proteins-protein, and disease proteins in a multilevel complex network (triple network). Next, they applied a random walk network diffusion algorithm. The proximity of lncRNAs to disease genes was measured based on the probability of connecting edges. Which lncRNA was associated with a given disease was determined based on the probability of accessibility in the heterogeneous network. A model called the DWLMI was introduced by Yang et al. 36 . They inferred the potential associations between lncRNAs and miRNAs by representing them as vectors via a lncRNA-miRNA-disease-protein-drug graph. There are some other models to associate protein and miRNA data with building heterogeneous networks. For example, Zhou et al. 37 introduced a novel computational method to predict lncRNA-disease associations. They integrated associations between microRNAs (miRNAs), lncRNAs, proteins, drugs, and diseases to construct a heterogeneous network and then trained predictive models with a rotating forest classifier. Alternatively, Yuan et al. 38 developed a machine-learning approach named LGDLDA. They computed similarity matrices from multivariate data and then integrated the neighborhood information in the similarity matrix using nonlinear feature learning of neural networks. Finally, LGDLDA ranked candidate lncRNA-disease pairs and then selected potential disease-related lncRNAs. Similarly, Li et al. 39 proposed an approach called DF-MDA. They constructed a heterogeneous network by integrating various known associations between miRNAs, diseases, proteins, long noncoding RNAs (lncR-NAs), and drugs. They then classified miRNA-disease associations using a random forest classifier. Noteworthy is that cyclic RNAs and metabolites were found to be somehow inextricably linked to the generation of disease and could serve as complementary data for lncRNA-disease studies 40,41 . This paper proposes a method for prediction by the matrix completion technique inspired by recommender systems. Matrix completion is a common strategy in recommendation systems. Collaborative filtering algorithms in recommendation systems are a matrix completion technique. There are two kinds of collaborative filtering algorithms: a memory-based collaborative filtering algorithm and a model-based collaborative filtering algorithm. Memory-based collaborative filtering mainly uses heuristics to make recommendations by using similarity as weights and nearest neighbors to fill in missing values for user-item matrices to predict user needs and make recommendations, including both user-based and item-based algorithms; model-based collaborative filtering such as hidden semantic model and matrix factorization is based on matrix complementation theory, which is the extension of compressed perception theory from A low-rank and sparse matrix can be restored to a complete matrix with high accuracy 42 . The user-item matrix in recommendation systems is primarily a low-rank and sparse matrix. This theory can restore an entire matrix with no missing values to simulate a score for the user and recommend high-scoring items. Since the implicit semantic model and matrix decomposition have low explanatory power and high time cost in the face of large-scale data, this paper proposes a two-layer multi-weighted nearestneighbor prediction model using a method similar to memory-based collaborative filtering, where neighbors are assigned weights to reassign values to the target matrix. The target matrix is an adjacency matrix consisting of lncRNAs and diseases. Relevant lncRNAs and diseases are marked as one at the corresponding position in the matrix, while unknown relationships are marked as 0. The size of the reassigned matrix elements represents the degree of correlation between lncRNAs and diseases. A higher value indicates a higher correlation. We can filter out the lncRNAs with high correlation for researchers to conduct biological experiments, thus narrowing the scope of experiments to improve research efficiency, which is a guide for biomedical experiments. This model provides a reliable solution to the prediction problem of sparse data. When the data are extremely sparse, the accuracy of the similarity calculation is improved by correlating correlated data, thus enabling the model to achieve satisfactory prediction results. This paper's available data in the lncRNA-disease dataset were less than 0.1%. The AUC value of the fivefold cross-validation experiment reached more than 0.94 after the correlationrelated dataset assisted the similarity calculation. The code and experimental data are publicly available at https:// github. com/ nrgz/ DMWNN-data.

Materials
This study integrated three different datasets: the lncRNA-disease relationship dataset, the miRNA-lncRNA relationship dataset, and the miRNA-disease relationship dataset. These were taken from the HMDD, starBase v2.0, and MNDR v2.0 databases, respectively. After comparing and removing duplicate values, we extracted 1089 lncRNA data, 373 disease data, and 246 miRNA data, as shown in Table 1.
The lncRNA-disease relationship, miRNA-lncRNA relationship, and miRNA-disease relationship were used to construct the adjacency matrices LD, ML, and MD. lncRNA-disease relationships were extracted by merging and removing duplicate values from LD, ML, and MD to form the target matrix Y. In Y, if the lncRNA was associated with the disease, the corresponding position element was set to 1. If the lncRNA was not associated with the disease, the corresponding position element was set to 0. Y was a matrix of 1089 rows and 373 columns, containing 407 nonzero entries. Detailed data are in the referenced supplementary information (Supplementary informations 1, 2 and 3).

Method
Similarity calculation method with potential association attributes. In previous similarity calculations, {0, 0, 0, 0} and {1, 1, 1, 1} in the adjacency matrix were often defined as unrelated, where 1 and 0 represented proven and unproven associations, respectively. However, zero terms have the potential to be transformed into unity. Based on this assumption, a similarity calculation method incorporating the potential association property was proposed. The data initially considered irrelevant were given weights to participate in the calculation. The specific algorithm is described by Eq. 1: where λ is the weight parameter, Γ is a vector with the same dimensions as X and Y, and each element is 1. X and Y are vectors with the same dimensions and elements consisting of 0 and 1. X × Y is the exterior product between vectors X and Y. The result is a vector.
LncRNA similarity. The LMD matrix with lncRNA as row miRNA and disease as the column was constructed with LD, ML, and MD matrices, and the similarity matrix S l was calculated and built according to Eq. 2.
where LMD i and LMD j denote the i-th and j-th rows of the matrix LMD , respectively, Ŵ l is a vector with the same dimension as LMD i and all elements are 1, and is the weight parameter.
Disease similarity. The DML matrix with lncRNA as row miRNA and disease as the column was constructed with LD, ML, and MD matrices. The similarity matrix S d was calculated and built according to Eq. (3).
where DML i and DML j denote the i-th and j-th rows of the matrix DML, respectively Ŵ d is a vector with the same dimension as DML i and all elements are 1, and is the weight parameter.
Double multi-weighted nearest neighbor model. The double multi-weighted nearest neighbor model (DMWNN) was inspired by the memory-based collaborative filtering algorithm, unlike the recommendation algorithm, as a potential association prediction model between lncRNAs and diseases. It does not need to distinguish whether the main body is a user vector or an item vector but only needs to mine the association between lncRNAs and diseases as much as possible. Therefore, the DMWNN model can fill new values for the 0 items in the matrix from row and column vector perspectives and fuse the two filling results as the final. Figure 1 illustrates the construction process of the single-layer model. The steps of model construction were as follows: Step 1. Construct the index matrix k based on the correlation. Taking S l as an example, put the first k values with larger values in row i of the matrix S l into row i of matrix K l in the order from highest to lowest. www.nature.com/scientificreports/ where matrix S ′ l is the matrix obtained by sorting each row of the matrix S l in descending order, and S ′ l i,j is the number of rows in the matrix S l that rank j in similarity with the i-th row.
Step 2. Different weights are assigned to objects at different distances, with high weights for close objects and low weights for the opposite. This model uses a linearly decreasing weight assignment method, and the t -th close neighbor weight is: where k is the number of nearest neighbors, ω is the distance weight, and t is the ranking of the neighbors.
Step 3. The row vectors in the target matrix Y are processed according to Eq. (6). Step 4. The column vectors in the target matrix Y are processed according to Eq. (7).
New values are filled for the 0 entries in each column to obtain the matrix Y 2 .
Step 5. The matrix Y 1 is fused with the matrix Y 2 according to Eq. (8) to obtain the matrix Y 0 .
where η 1 and η 2 are the weight parameters. In this model, η 1 and η 2 are taken as 0.5.
Step 6. The row vectors of the Y 0 matrix are processed according to Eq. (9).
New values are filled for the 0 entries in each row to obtain the matrix Y

Results and discussion
Cross-validation. Cross-validation is a standard method for model training when the amount of data is insufficient. Usually, model training requires data splitting into a training set, test set, and validation set. This implies that the training set has less data than the original data, and the validation set can contain only some initial data. The cross-validation method can use all the data for training and validation. For example, the fivefold cross-validation method can split the data into five parts, taking one as the validation set and the rest as the training set each time and repeating the experiment five times. Using the average performance of the five experiments as the model performance under the current parameters, one can also avoid the problem of overfitting. The final evaluation of the proposed method's quality is the "area under the curve" (AUC) value 43 . It is usually defined as the area under the receiver operating characteristic (ROC) curve. The false positive rate (FPR, 1-specificity) represents the abscissa of the ROC curve. The true positive rate (TPR, sensitivity) is the ordinate of the ROC curve, and the calculation formulas for FPR and TPR are given in Eqs. (12) and (13), respectively: where TP and FP are the numbers of positive samples with true and false classifications, respectively. Similarly, TN and FN are the numbers of negative samples with true and false classifications, respectively. Similarity metric evaluation. The Cosine, Pearson, and Jaccard similarity correlation coefficients were selected for comparison in this study's performance evaluation experiments. As the accuracy of similarity algorithms couldn't be obtained by direct comparison, several similarity algorithms were used separately in prediction models to reflect the merits of similarity algorithms by the performance of their respective models. Since the DMWNN two-layer model based on the Cosine similarity failed to fully meet the requirement of assigning values to all zero terms, the three-layer nearest neighbor model was used to evaluate the performance. The fivefold cross-validation method was chosen to represent the model's predictive performance by the average performance obtained five times.
In the fivefold cross-validation experiments, we manually adjusted the parameters many times based on the results of each experiment to obtain the best performance for each model. The experiments yielded that the improved calculation method, Cosine, Pearson, and Jaccard similarity correlation coefficients reached their optimal performance at k = {217, 268, 276, 323} with AUC values of {0.9477, 0.9399, 0.9385, 0.8930}, respectively. From Fig. 3, it can be seen that the improved similarity calculation method outperformed all other methods under study.   Fig. 4.
We continuously adjusted the parameters through fivefold cross-validation experiments according to the above performance trends so that the models corresponding to different λ achieved the best performance. The respective performance reached the optimum when k was taken {300, 217, 220, 260, 259, 263, 262, 373, 373, 372} by the experimental verification (see Table 2). The trend of performance fluctuation is shown in Fig. 4. The highest AUC value of 0.9477 was reached at the weight parameter λ = 0.1, providing the model's best performance.
To more comprehensively evaluate this model, we used a broader range of evaluation criteria, including accuracy (Acc.), sensitivity (Sen.), specificity (Spec.), precision (Prec.), and the Matthews correlation coefficient (MCC). The prediction performance is listed in Table 3. The average Acc., Sen., Spec., Prec., MCC, and AUC values were 91.64, 92.01, 91.65, 1.21, 9.80 and 93.82%, respectively, when using the proposed method to predict lncRNA-disease associations. The standard deviations of these values were 2.11, 3.31, 2.12, 0.48, 2.16 and 1.91%, respectively. Although the model had low scores in Pre and MCC, on balance, this model was a reliable predictor. At the same time, the lower standard deviation of these standards implied that the proposed model was robust and stable.
Multilayer model comparison. Since the target matrix Y was too sparse, even if the number of nearest neighbors k was set to the maximum, the single-layer model would fail to achieve the purpose of assigning values to   www.nature.com/scientificreports/ all 0 items. Therefore, a multilayer model was adopted to superimpose the processing.The more stacked layers, the smaller the minimum k value to meet the requirement. This implies that the maximum k value that can be selected for the next stacking also becomes smaller. If k is no less than 3, the model will detect that there are no more zero items in the matrix Y after five stacking processes, and the sixth process will be avoided. Experimentally, the minimum k value of the 5-layer model was 3, and the maximum k value used to continue the stacked model execution was 2. At k equal to 1 or 2, the stacking had to contain more than five layers to meet the assignment requirements. However, the stacking of more than five layers was not considered to ensure that the model would have less complexity and higher generalization ability. The same fivefold cross-validation method was used, and the average performance obtained five times was used to represent the model's prediction performance. The parameters were manually adjusted to achieve the best performance for each multilayer model based on the results of each experiment. The two-layer model was experimentally verified to obtain the optimal performance. That of each model is described in Table 4. Figure 5 shows that the two-layer model outperformed all other models, so it was chosen as the final prediction model. The prediction performance deteriorated with the number of layers, probably because each layer's prediction was an iteration of the previous layer's prediction result, resulting in increasingly unrealistic forecasts.
Performance comparison with previous models. The AUC values of {0.9603, 0.8694, 0.8565, 0.8519} were obtained by testing this model, as well as the LFMP 45 , CFNBC 18 , and NBCLDA 17 models, using the leave-oneout cross-validation under the same dataset. The AUC values of the DMWNN model proposed in this paper significantly exceeded those of the other models, demonstrating the best prediction performance. The ROC and AUPR comparison charts based on LOOCV are plotted in Fig. 6.
To better examine the model's predictive performance, we used a new dataset for comparison with other models. The results are shown in Table 5. The data were collected from Lnc2Cancer, LncRNADisease, GeneRIF, HMDD (v2.0), and starBase. In total, they contained 240 lncRNAs, 495 miRNAs, and 412 diseases. It can be seen that the AUC of DMWN reached 0.923, exceeding those of other models in the tested data. In particular, this AUC value exceeded that of SIMCLDA 46 by 24%, MFLDA 47 by 47%, LDAP 11 by 7%, and Ping's method 48 by 6%. Moreover, DMWNN achieved an AUPR of 0.340, outperforming all other techniques involved in the comparison. Specifically, it outperformed SIMCLDA by 258%, MFLDA by 415%, LDAP by 105%, and Ping's method by 55%, proving its excellent prediction ability.
Case study. We selected four common cancers (namely, stomach neoplasm, lung neoplasm, colorectal neoplasm, and glioma) to analyze the actual prediction performance of the proposed model. By processing the adjacency matrix of lncRNA-disease using the DMWNN model, the scores of lncRNAs in the columns of several cancers were ranked in the final prediction matrix, and the top twenty lncRNAs were selected for validation. This paper tested the prediction results using literature and database validations through the PubMed index and LncRNADisease database. www.nature.com/scientificreports/ After examination, 19 of the 20 lncRNAs screened to predict association with colorectal tumors were validated, while 18 of the 20 lncRNAs screened to predict association with glioma were validated, as shown in Fig. 7. In the case of gastric and lung cancers, nearly half of the potential associations were successfully predicted by the latest literature validation despite the absence of relevant data in the database. The prediction results are shown in Fig. 8. The performed case analysis strongly indicates that the DMWNN model proposed in this paper has high prediction accuracy.

Conclusions and model limitations
Recent research on long noncoding ribonucleic acids (lncRNAs) revealed their involvement in numerous human life activities and a key role in many pathological processes. While many biological experiments have explored the relationship between lncRNAs and diseases, it is still necessary to develop effective predictive models to assist biological experiments and improve experimental efficiency. This study adopted a simple and effective twolayer nearest neighbor model based on a similarity algorithm incorporating potential associations, which was  www.nature.com/scientificreports/ suitable for the data obtained by constructing the adjacency matrix. Unlike other algorithms, it assigned weights to data initially judged to be unrelated and then participated in calculating similarity. This similarity algorithm was experimentally verified to outperform several similar algorithms, being the core of the proposed two-level nearest neighbor model. It screened the neighbors, based on the degree of similarity, as a crucial component of the prediction score. The other three components making up the score were the distance and distance weights between the neighbors. The multilayer model was designed to predict unknown data adequately. Since too many layers would bias the prediction data, it was experimentally verified that two layers provided the optimal model's performance. The difference in performance produced by different datasets was evident in the comparison experiments. The first comparison experiment introduced miRNA in the similarity calculation, thus improving the similarity calculation accuracy. The results proved that the proposed model provided more accurate predictions when the amount of data was sufficient. While the prediction model heavily relies on the similarity algorithm, its similarity calculation's accuracy also depends on the amount of data. Therefore, the proposed model is extremely sensitive to the data, and the prediction results may vary significantly from one dataset to another. Moreover, similarity calculation requires data with a high correlation, and the closer the correlation, the more accurate the similarity calculation. However, the lncRNAs or target diseases usually have less relevant data, deteriorating the correlation's prediction efficiency.
In the follow-up study, we envisage combining miRNAs and proteins. Since lncRNAs generally interact with miRNAs and proteins to participate in various human life activities, the degree of their association is relatively high, and these data can be correlated to improve the model performance. Finally, our similarity calculation method is not complete enough and can only predict whether lncRNA is related to disease, which is still a far shot from screening out lncRNAs that are truly involved in disease formation. Given that lncRNAs have become critical regulators of cancer pathways and biomarkers of various diseases, we also intend to design more reasonable similarity calculation methods from gene expression and survival data to improve the prediction accuracy and use the results in targeted cancer therapy.